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There have been several studies of the genome-wide temporal 
transcriptional program of viruses, based on microarray experiments, 
which are generally useful in the construction of gene regulation net- 
work. It seems that biological interpretations in these studies are di- 
rectly based on the normalized data and some crude statistics, which 
provide rough estimates of limited features of the profile and may 
incur biases. This paper introduces a hierarchical Bayesian shape re- 
stricted regression method for making inference on the time course 
expression of virus genes. Estimates of many salient features of the 
expression profile like onset time, inflection point, maximum value, 
time to maximum value, area under curve, etc. can be obtained im- 
mediately by this method. Applying this method to a baculovirus 
microarray time course expression data set, we indicate that many 
biological questions can be formulated quantitatively and we are able 
to offer insights into the baculovirus biology. 

1. Introduction. 

1.1. Transcription program of virus. With a custom made baculovirus 
DNA microarray, Jiang et al. (2006) investigated the temporal transcription 
program of one of the best characterized baculoviruses, AcMNPV, in its 
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host lepidopteran Sf21 cells. They uncovered sequential viral gene expres- 
sion patterns, which are possibly regulated by different mechanisms during 
different phases of infection, compared the transcription profile of a mu- 
tant virus with that of the wild type, and suggested that the array strategy 
taken in the study points to a very productive direction for constructing a 
baculovirus gene regulation network. 

The experiments of Jiang et al. (2006) are briefly summarized as follows. 
They use single color cDNA microarray experiments with external controls 
for data normalization. Each chip has exactly four spots for each of the 156 
open reading frames, referred to as genes henceforth, of baculovirus; total 
RNA samples of baculovirus genes were taken at several different time points 
during the 72 hours following infection; the sample for each time point is 
hybridized to a single chip. The normalized time course expression data are 
shown to be in good agreement with those obtained by the real-time PCR 
method for five randomly chosen genes; the data for each gene used in the 
study of temporal transcription is based solely on the normalized expression 
levels at these time points and on the crude estimates of its onset time and 
the time that its expression attains its maximum. 

A rough idea regarding virus gene expression is that genes of a virus 
have their time course expression level being zero initially, then increasing 
after a while and finally decreasing; because viruses do not have their own 
machinery for gene transcription, their genes start to express only after 
getting into cells, and cells may eventually malfunction when infected. It is 
of interest and feasible to make use of this idea to profile the time course 
expression of each virus gene, based on microarray data, to estimate salient 
features of the profile like onset time, time to maximum value, maximum 
value, area under the profile curve, etc. and to test the shape hypotheses 
on the profile curve like unimodality on certain time intervals. It is hoped 
that this approach to gene expression analysis of viruses would eventually 
provide a sound basis for the study of the temporal transcription program 
of viruses. 

The purpose of this paper is to propose a Bayesian shape restricted re- 
gression model based on the above property of a virus, illustrate this model 
by profiling the time course expression of genes of baculovirus, and indicate 
that this approach does provide more insights into baculovirus, compared 
with the crude statistics used in Jiang et al. (2006). Among others, a promi- 
nent example in this regard is that this new approach seems to support the 
widely accepted conjecture that structural genes of the virus may have a 
larger amount of total expression level, which is hard to examine by the 
method in Jiang et al. (2006). 

This method is illustrated on the dataset for the baculovirus Bac-PH- 
EGFP in Jiang et al. (2006). With 16 time points, this dataset seems to hold 
a promising opportunity to capture the main features of the transcription 
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profile. We note that the other two datasets in Jiang et al. (2006) have only 
6 time points and 5 of them are in the initial two hours post infection and 
it is hard to infer some of the main features of the profile based on them. 

Because microarray experiments offer feasible approaches to the studies 
of the genome- wide temporal transcriptional program of viruses, which are 
generally useful in the construction of gene regulation network, there have 
been many genome-wide expression studies of virus genes. See, for example, 
Yang et al. (2002), Iwanaga et al. (2004), Duplessis et al. (2005), van Mun- 
ster et al. (2006), Majtan et al. (2007), Smith (2007) and references therein; 
they considered different viruses and/or different host cells. It seems that 
all the biological interpretations in these studies are directly based on the 
normalized data and crude statistics, which seem to provide only naive es- 
timates of limited features of the profile, and there are some discrepancies 
reported in the literature; see, for example, Smith (2007). It is of great in- 
terests to compare the transcriptional studies based on different but related 
strains of viruses and/or different and related host cells so as to build a 
gene regulation network. We note that comprehensive comparisons depend 
on comprehensive and rigorous time course expression profiling of genes in 
each study. The focus of this paper is the latter. 

1.2. Statistical modeling strategy. Preliminary examination of the Bac- 
PH-EGFP data suggests that two of the 156 genes seem to have their expres- 
sion levels being zero finally as well as initially and the rest of the 154 genes 
being zero only initially, probably because no data were taken at time point 
beyond 72 hours and the life cycle of baculovirus is longer than 72 hours, 
according to Friesen and Miller (2001). To make the presentation concise, 
we limit our attention to these 154 genes in this paper; the other two genes 
can be studied similarly. 

Let A denote the set of all smooth functions on [0, 1] that are zero ini- 
tially, start to increase after a while, and stay positive onward. The task of 
profiling the time course expression level of virus genes will be considered 
a shape restricted regression problem with the regression function belong- 
ing to A. Let g = 1,2, . . . , 154 index the 154 genes of the baculovirus. For 
g = 1, . . • , 154, we assume that, given F g in A, 



Here {Xk \ k = 0,...,K} are constant design points in [0,1], {Yjkg \ j = 
1, . . . , rrik, k = 0, . . . , K, g = 1, . . . , 154} are response variables, and for every 
j = 1, . . . , m^, k = 0, . . . , K, g = 1, . . . , 154, ej^g are independent normal errors 
with mean ji g and variance 
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In this paper represents a time point at which the mRNA sample 
is taken for microarray experiments; Yjkg is the expression level, in terms 
of fluorescent intensity, obtained at the jth spot of the gth gene for the 
sample taken at time point X}.. More specifically, in our data, let [0,1] de- 



note the time period of 72 hours, then K = 15, = 4, (Xq,Xi, . . . , X15) = 
(0, 1/216, 1/108, 1/72, 1/36, 1/24, 1/12, 1/8, 1/6, 5/24, 1/4, 1/3, 5/12, 2/3, 5/6, 
!)• 



The variance structure in (1.2) is a simple way to take into considera- 
tion the observation that for single color cDNA microarray experiments, 
larger intensities often incur larger variances when considering replicates. 
The reason for not assuming Sj^g having zero mean is that there are always 
background intensities due to nonspecific hybridization and, hence, E(Yjk g ) 
may not be zero even when the expression level F g {Xk) is zero. 

We now explain that Bernstein polynomials can be used to study the 
above shape restricted regression model. For integers < i < n, let (fi t n(t) = 



C"f(l - t) n ~\ where Cf = n\/(i\{n - The set {ip^ n \ % = 0, . . . ,n} is 
called the Bernstein basis for polynomials of order up to n. Let B = [0, 1] x 
UT =3 ({n} xl"" 1 ). Define F:Bx [0,1] — > R 1 by 



where (c, n, 62 n , ■ ■ ■ , b n n ) € B and t S [0, 1]. We also denote (1.3) by F c b n (t) 
if b n = (b 2 , n , . . . , b n>n ). We will see in Section 2 that i^ c ,b n (') is a member 
of A if < min; = 2 i ... in b^ n < max; = 2 v .. jn b^ n , and every member of A can be 
approximated by F c ^ n {-) satisfying these restrictions on b n . This observation 
suggests that, by means of (1.3), Bernstein polynomials form a useful tool 
to introduce priors on A for a Bayesian analysis. 

We will consider Bayesian hierarchical models based on (1.3). With pri- 
ors on a space of smooth functions satisfying certain shape restrictions and 
parameters in the priors based on crude estimates from data, our approach 
has the advantage of utilizing prior knowledge from biology; with 154 cor- 
related and possibly similar profiles to study, hierarchical regression models 
take advantage of the possibility of data driven shrinkage- type estimates. 

We note that Bayesian shape restricted inference with priors introduced 
by Bernstein polynomials was studied by Chang et al. (2005), which pro- 
vides a smooth estimate of an increasing failure rate based on right censored 
data, and by Chang et al. (2007), which compares the Bernstein polyno- 
mial method with the density-regression method [Dette, Neumeyer and Pilz 
(2006)] in estimating an isotonic regression function and a convex regression 
function. It was also shown there that these Bayesian estimates perform 
favorably, in addition to the facts that these priors easily take into consid- 
eration geometric information, select only smooth functions, can have large 



(1.3) 
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support, and can be easily specified. We note that Petrone (1999) made use 
of these nice properties in her study of random Bernstein polynomials and 
for sampling the posterior distribution, proposed algorithms that regards 
the construction of the Bernstein-Dirichlet prior as a histogram smoothing. 

The present paper indicates that the expression profiles of virus genes can 
also be efficiently studied by random Bernstein polynomials, making use of 
the shape restrictions described above. We will estimate salient features of 
the profile like onset time, inflection point, maximum value, time to maxi- 
mum value, area under the profile, etc., utilizing the fact that the derivative 
of a polynomial has a closed form. We will also test the hypothesis on the 
shape of the time course expression profile; for example, we will examine 
whether it is unimodal on the region [0,r] for some r < 1. In fact, by cal- 
culating both the posterior probability and the prior probability that it is 
unimodal on [0,r], we offer an assessment of the strength of the evidence 
in favor of the hypothesis. We note that this direct approach to hypothe- 
sis testing is markedly different from the frequentist p-value approach, as 
discussed in Kass and Raftery (1995) and Lavine and Schervish (1999), for 
example. 

There is a large literature on shape restricted inference since Hildreth 
(1954) and Brunk (1955). Most of them treat isotonic and concave regres- 
sions from the frequentist viewpoint. Readers are referred to Gijbels (2003) 
for an excellent review and to Dette, Neumeyer and Pilz (2006) for some 
of the more recent developments. For the Bayesian approach, there are the 
works of Lavine and Mockus (1995), Dunson (2005) and Chang et al. (2007), 
among others. This paper illustrates the use of the Bernstein polynomial in 
investigating the strength of the evidence provided by the data in favor of 
the hypothesis on the shape of the regression function, in addition to its use 
in estimation. 

This paper is organized as follows. Section 2 presents the Bernstein poly- 
nomial geometry and the hierarchical regression model. Algorithms for 
Bayesian inference are given in the Appendix. Section 3 illustrates the method 
by simultaneously analyzing all the data for these genes and indicates that 
this method does bring insights into baculovirus biology. Section 4 concludes 
with a brief discussion. 

2. Bayesian inference. 

2.1. Bernstein polynomial geometry. Let F c ^ a {t) = Y^i=o a i L Pi,n(jE^ K(c,i] (*)> 
where a = (ao, . . . ,a n ). Proposition 1 provides a sufficient condition on a 
under which F ca is in A. Proposition 2 complements Proposition 1 and pro- 
vides Bernstein- Weierstrass type approximations for functions in A. In this 
paper derivatives at and 1 are meant to be one-sided. All the proofs of the 
propositions in this paper are omitted, because they are similar to those in 
Chang et al. (2005) and Chang et al. (2007). 
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Proposition 1. Let n > 3 andc e [0, 1). IfO = do = a\ < Hihn. = 2,... in az < 
max/ = 2 i ... i n a\, then F CjCl is continuously differentiable, constantly on [0, c], 
and larger than on (c, 1). 

Let /„ = {F^ a | c G [0,1), a = (a ,...,a n ) satisfying = a = a\ < 
min/ = 2 i ... i n o,i < max; = 2,... in a;}- For two continuously differentiable functions 
/ and /, define e(f,f) = ||/-/||oo + ||/'-/ ||oo, where /' denotes the deriva- 
tive of /, and || • | loo is the sup- norm for functions on [0, 1]. Then we have 
the following: 

Proposition 2. Let V = \J^ =z I n - Then V is dense in A, under e. 

2.2. Bayesian regression model. 
(i) Hierarchical prior 

For each g = 1, . . . , 154, we will introduce probabilities ir g on A as follows. 
We first describe the framework and then the specific priors to be used. Let 
iri^g be a probability density function on [0, 1] , meant to be the prior on 
the onset time c of gene g; it2,g be a probability mass function on the set 
of positive integers {3,4,...}; for each n, vr3 i9 (-|n) be a probability density 
function on W l ~ l of b n . The probability density/mass functions 7i"i j0 , 7T2 jS and 
7T3 i5 jointly define a probability TT g on £> by the product vri >9 (c) x 7T2 i9 (n) x 
^^(frnl^); this in turn defines a probability measure on A by (1.3). Let Tx^ g 
be a probability density on M 1 for n g , the mean of £jkg- Then -K g = TT g x n^ g 
is the prior density we will use on 6 x M 1 . 

We now describe the strategies to specify 7Ti i9 , iT2,g, ^3,g and Tt4 >g - Because 
our preliminary studies based on a single gene suggest that the posterior dis- 
tributions of several features do not vary much with the prior order of the 
Bernstein polynomial so long as it is not too small, we take 7T2 j9 to have 
probability 1 for n = 15, which has the advantage of lessening the computa- 
tional burden. The priors 7Ti, 9 , vt3 i9 and 7r^ g are defined in the following by 
crude estimates based on all the 154 genes. 

For each g = 1, . . . , 154, let Y^ g < Y^ g <■■• < ^(is) 5 be the order statis- 
tics for {Y 0g , Y lg , . . . , Y 15g }, where Y kg = Y,j=i Y jkg/^- The prior 7r 4i9 is the 
uniform distribution on [0, 2Y(jg]- 

We now define 7Ti >9 for onset time. Let k(g) be the integer such that 
Y ~k{g)g = y (i5)g; let Kd) = max{/c | k = 0, 1, . . . , k(g) satisfying Y kg < 2Y 0g } + 
1. Let X g = X fc(9) and X g equals X^ + ^y 2 if (k(g) + k(g))/2 is even, and 
equals ^rk(g)+k(g)+i)/2 otherwise. Let ct\ and be chosen so that the beta 
distribution Beta{a\,a2) has mean X^=i(^9/^s)/154 and variance 



( max {X g /X g }- nun {X g /X g })/4 

\g=l,... ,154 o=l,. ..,154 / 
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Let 4>n = a\ —0.5, 0i2 = ct\ +0.5, 02i = «2 — 0.5 and 022 = «2 + 0.5. We note 
that for the present dataset, a\ = 2.7771 and a 2 = 2.4481, thus, 0n = 2.2771, 
^12 = 3.2771, 02i = 1-9481 and 22 = 2.9481. 

Let 0i and 02 be two random variables having distributions respectively 
C/m/orm(0n, 0i2) and Uniform(4> 2 i , 022)- Let Ui, . . . , be a random sam- 
ple of size 154 such that the conditional distribution of U g given 0i and 02 
is i?eta(0i,02) for each g = 1,...,154. We assume that conditional on 0i 
and 02, the prior density 7Ti >9 of the onset time of gene g is the probability 
density function of X g x (7 9 . In particular, we assume the onset time is in 
the interval [0,X 9 ]; this assumption results from examining the data closely. 

We next define 7T3 j9 (-|n), which takes into consideration the range of the 
observed expression levels and is motivated by the propositions in Section 
2.1. Let Yj[ k q g = Y jkg , if Y^ g = Y kg . Denote by Y (1[k])g < Y {2[k])g < Y {3 [ k])g < 
Y(A[k]) g the order statistics of {Y l[k]g ,Y 2 [ k]g ,Y 3[k]g ,Y 4[k]g }. Let 3 and 4 be 
two random variables having distributions respectively Uniform (03i, 032) 
and Uniform(4>4i, 042), where 03i, 032, 04i and 042 are constants to be as- 
signed later. Let V 2 ^ g , • • • , V15 g be a random sample such that the conditional 
distribution of each Vi :9 given 03 and 04 is Beta(4> 3 ,4>4). We assume that 
conditional on 03 and 04, the prior density function -K 3tg (-\n) of the coef- 
ficients b n ^ g = (&2,n,gj • • • j b n ,n,g) is the joint probability density function of 
2y (4[i5])g • (V2, g , ■ ■ In the present study, 3 i = 4i = 0.5 and 32 = 

042 = 1.5, which give a large support of the prior. Let = (01,02,03,04), 
which are the hyperparameters. 

Thus, under the assumption that (c g ,n,b n! g, /jL g ) ^BxR 1 are condition- 
ally independent given 0, the posterior density v of all the parameters and 
hyperparameters, given the data, is proportional to 

( 154 K m k \ 

(2-1) < ]J II W~9kg {Y jk g - F CgMg (X k ))ir g (c g ,n,b ntg ,n g \<j)) I x -0(0), 
{ g =lk=0j=l J 

where gk g is the normal density of Ejk g specified in (1.2) and V>(0) — 11^=1(^2 — 
0ii) _1 is the joint hyperprior density function, 
(ii) Sampling the posterior distributions 

Based on the hierarchical model, we use a Metropolis-within-Gibbs al- 
gorithm to generate the posterior distributions for inference; details of the 
algorithm are in the Appendix. The software is written in Matlab, which is 
available from the author upon request. The variance o~ kg in (1.2) to be used 

in the algorithm is decided as follows. Let a kg = Yl^=i(Yjkg — Ykg) 2 /3 and 

i g € {0,1,2} be the number that minimizes L(^ g ) = Ylrk=o(Qkg — Q g ) 2 /^ 

with Q kg = °\JY% and Q g = EHoQkg/^ for £ g = 0, 1 and 2. With x<*> 
denoting the current state of the Markov chain and fi g the background noise 
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in the current state we use 

ol g = d 2 g (F g {X k ) + (i g f* 

for the a\ g in (1.2) when updating x^ +1 \ where a 2 g = Y.kLoi^lg/^kg)/ 16 
and F g is the F g determined by x^> . 

We run 5 MCMC chains with initial values chosen randomly from the 
hyperpriors and the priors of each gene g, and monitor convergence by the 
Gelman-Rubin statistic R, following the suggestion in Gelman and Rubin 
(1992) and Gelman et al. (2004), pages 294-297. For each of the 154 genes, 
the Gelman-Rubin statistics R is calculated for six estimands of interest, 
which are onset time (Ton), time to maximum (Tmax), maximum (Max), 
time at which the slope is the highest (Tslope), the highest slope (Slope) 
and the area under the curve on [0,1]. Each of the five chains is run with 
126,000,000 MCMC iterations and with a burn-in period of 12,600,000 it- 
erations, in which almost all the R are less then 1.1. The 56,700 updates, 
collected by taking one for every 10,000 updates in the last 90% of updates of 
these 5 sequences, are considered the sample from the posterior distribution, 
which form the basis for inference. 

(hi) Numerical performance 

To evaluate the numerical performance of the above hierarchical Bayesian 
method, we studied a similar, but not hierarchical, Bayesian method for the 
analysis of the time course expression of a single virus gene. This nonhier- 
archical Bayesian method, modeling the expression profile also by Bern- 
stein polynomials, is more flexible in the sense that it allows nontrivial prior 
probability on the order of the Bernstein polynomial and is amenable to 
simulation studies. In fact, the simulation studies indicate its excellent nu- 
merical performance. Details of this method and the simulation studies are 
in the supplementary article [Chang et al. (2008)]. We will evaluate the per- 
formance of the hierarchical Bayesian method by comparing it with that 
of the nonhierarchical Bayesian method, in the context of analyzing our 
baculovirus expression data. The genes that we chose to conduct this evalu- 
ation are selected by the criterion described in the following paragraph; this 
choice serves also the purpose of comparing the results from our hierarchical 
Bayesian method and that in Jiang et al. (2006), in addition to evaluating 
the numerical performance of our method. 

For each gene, we consider the differences between the times obtained 
from the hierarchical Bayesian method and those in Jiang et al. (2006). Fig- 
ure 1 gives a rough idea of the differences. The first (second) coordinate of 
a dot in Figure 1 is the onset time (time to maximum) of a gene obtained 
from the hierarchical Bayesian method minus that of the same gene using 
the naive method. A gene is selected if either its difference in onset times 
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Table 1 

Estimates of the onset time based on the naive estimate, 
hierarchical Bayesian method and Bayesian method 





Jiang et al. 


Hierarchical 








(2006) 


Bayesian 


Bayesian 


ID (Name) 


Estimate 


Mean 


Stdv 


Mean 


Stdv 


ID 130 (plO) 


0.0697 


0.0724 


0.0080 


0.0756 


0.0043 


ID 143 \pe38) 


0.0335 


0.0293 


0.0110 


0.0211 


0.0119 


ID 145 (pk-1) 


0.1785 


0.1931 


0.0025 


0.1930 


0.0025 


ID 146 (pk-2) 


0.0552 


0.0349 


0.0091 


0.0292 


0.0104 


ID 152 (v-cath) 


0.1374 


0.1836 


0.0072 


0.1802 


0.0095 



or that in times to maximum is larger than 10 hours; we note that a dif- 
ference of this size may cause concerns in biological interpretation. There 
are in total five such genes and their differences in onset times are not as 
large as their differences in the time to maximum; we carry out time course 
expression for these five genes separately by the nonhierarchical Bayesian 
method. The onset times and the times to maximum of these five genes 
are shown in Table 1 and Table 2 respectively. The first column of Table 1 




Fig. 1. The differences between the Ton (Tmax) based on the hierarchical Bayesian 
method and the crude estimate. The first (second) coordinate of a dot is the onset time 
(time to maximum) of a gene obtained from hierarchical Bayesian method minus that of 
the same gene using naive method. 
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gives the ID and the name of these genes; column 2 gives the onset times 
from Jiang et al. (2006); column 3 gives the means and standard deviations 
(Stdv) of the posterior distributions of the onset times from the hierarchical 
Bayesian method; column 4 gives those from the nonhierarchical Bayesian 
method. The entries in Table 2 bear similar meanings as those in Table 1. 
It is clear from these tables that the results from the hierarchical Bayesian 
method and those from the nonhierarchical Bayesian method are in quite 
good agreement. This suggests that the hierarchical Bayesian method seems 
to produce reliable results in the study of baculovirus gene expression. 

We note that one of the genes, ph, was knocked out and we included 
it in the hierarchical Bayesian analysis as a way to see if our method is 
capable of identifying it. Indeed, it does; it has its time course expression 
profile much lower than all the others; details are omitted. We also note that 
we compared other features of several genes obtained from the hierarchical 
Bayesian method and those from the nonhierarchical Bayesian method and 
find them in very good agreement. To shorten the paper, we do not report 
the comparison. 

One referee raised the question of whether our procedure automatically 
identifies genes having different shapes like the two singled out by initially 
examining the data. Indeed, based on the posterior distributions, we get 
these two genes identified by performing posterior predictive checking, as 
described in Gelman (2003) and Gelman, Meng and Stern (1996). 

3. Applications to the baculovirus data. Based on the samples from 
the posterior distribution obtained in Section 2, this section carries out a 
genome- wide expression analysis of the baculovirus and compares the results 
with those in Jiang et al. (2006). It seems that the method of this paper re- 
veals more insights into virus biology than the naive method and in case 
the results from this paper and those in Jiang et al. (2006) are significantly 



Table 2 

Estimates of the time to maximum based on the naive estimate, 
hierarchical Bayesian method and Bayesian method 





Jiang et al. 


Hierarchical 








(2006) 


Bayesian 


Bayesian 


ID (Name) 


Estimate 


Mean 


Stdv 


Mean 


Stdv 


ID 130 (plO) 


0.7343 


0.5855 


0.0079 


0.5859 


0.0068 


ID 143 \peS8) 


0.2185 


0.3536 


0.0248 


0.3479 


0.0230 


ID 145 (pk-1) 


0.3515 


0.5293 


0.0046 


0.5285 


0.0051 


ID 146 (pk-2) 


0.2127 


0.4163 


0.0179 


0.4171 


0.0166 


ID 152 (v-cath) 


0.3564 


0.4990 


0.0082 


0.4924 


0.0101 
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different, it is more often than not that the results from this paper are in 
better agreement with biology. Since one of the genes, ph, was knocked out, 
the analysis in Jiang et al. (2006) was based on 155 genes and the following 
studies regard the expression of the 153 genes. 

3.1. Times to maximum. According to Table 2, the differences in times 
to maximum for 5 genes are larger than ten hours. Except for the gene plO, 
our method gives larger times to maximum. The following comments seem to 
suggest that the times to maximum from the current approach allow better 
or equally sensible interpretation, based on their gene product function. 

pe38 encodes a transcription factor important for virulence of the bac- 
ulovirus [Milks et al. (2003)]. It was shown that it expresses from the imme- 
diate early phase throughout the late phase [Knebel-Morsdorf et al. (1996)]. 
Larger time to maximum might reflect this fact more satisfactorily. 

pk-1 is a component of AcMNPV very late gene transcription complex 
[Mishra, Chadha and Das (2008)]. Reilly and Guarino (1994) indicated that 
the transcription of pk-1 peaks in the very late stage of the infection cycle. 
Larger time to maximum seems more consistent with these observations. 
Although there is no report on the transcription time of pk-2, we tend to 
think that it is similar to pk-1 and hence transcribes also in the late stage 
of the infection cycle. 

v-cath encodes a papain type cysteine proteinase with cathapsin L-like 
property. Its proteinase activity is required for the breakdown of host tis- 
sues during the later stages of virus infection/pathogenesis [Hill, Kuzio and 
Faulkner (1995)]. Larger time to maximum better reflects the needs for its 
protein expression during this stage, when the host has been exhausted com- 
pletely and the virus can be spread to other hosts most efficiently. 

For the well-known late gene plO, although the hierarchical Bayesian 
method gives a smaller time to maximum than that in Jiang et al. (2006), 
we note that this smaller time to maximum is still the third largest among 
all the times to maximum of the 153 genes and hence seems to cause less 
concern. 

3.2. Time course expression analysis. To illustrate the use of our method, 
we now present, in Table 3, the features of the expression profile of the gene 
v-cath, which is one of the genes selected to evaluate the numerical perfor- 
mance of our method. Figure 2 presents the data and the posterior mode 
of its time course expression. Most of these features can not be reliably ob- 
tained by the naive method. This illustration also helps to appreciate that 
the data have substantial contribution in the inference on these features of 
v-cath. Table 3a reports the posterior probability and the prior probabil- 
ity that the parameter represents a unimodal curve on the interval [0, r] for 
r = 0.6667, 0.8333, 1.0000; the last two rows give respectively the ratio of the 
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posterior probability to the prior probability and the Bayes factor. Table 3a 
presents strong evidence, provided by the data, in favor of the unimodality 
of the time course profile. The posterior probability and the prior probability 
that the parameter represents a curve that is increasing before reaching its 

Table 3 
Data analysis for the gene v-cath 

Table 3a. Posterior probability (Po), prior probability (Pr), the ratio of Po to Pr, and the 
Bayes factor (Bf) of being unimodal on [0,t]. 



[0,T] 


[0, 0.6667] 


[0, 0.8333] 


[0, 1.0000] 


Po 


1.0000 


0.3280 


0.0280 


Pr 


0.4158 


0.2658 


0.0972 


Po/Pr 


2.4050 


1.2340 


0.2881 


Bf 




1.3482 


0.2676 



Table 3b. Posterior probability (Po), prior probability (Pr), the ratio of Po to Pr, and the 
Bayes factor (Bf) that it is increasing before reaching its global maximum. 



Po 


1.0000 


Pr 


0.3719 


Po/Pr 


2.6889 


Bf 


oo 



Table 3c. The Ton, Tmax, Max, Tslope, Slope, Li-norm and Tend of the mode of the 
posterior density v in (2.1) is given in the third column in the table. The sample mean, 
sample Stdv and support of the posterior probability distribution and the prior probability 
distribution of these features are respectively given in the fourth, fifth and sixth column. 



Estimand 




Mode 


Mean 


Stdv 


Support 


Ton 


Posterior 


0.1819 


0.1836 


0.0072 


(0.1197, 0.2079) 




Prior 




0.1329 


0.0510 


(0.0023, 0.2498) 


Tmax 


Posterior 


0.5093 


0.4990 


0.0082 


(0.4444, 0.5231) 




Prior 




0.7902 


0.2278 


(0.2083, 1.0000) 


Max 


Posterior 


1.7779 


1.6797 


0.0877 


(1.2713, 1.9397) 




Prior 




2.1139 


0.5189 


(0.3668, 3.0793) 


Tslope 


Posterior 


0.2176 


0.2358 


0.0420 


(0.1944, 0.4074) 




Prior 




0.4236 


0.3499 


(0.0509, 1.0000) 


Slope 


Posterior 


8.1613 


8.7394 


1.0778 


(5.6079, 13.5625) 




Prior 




15.0351 


9.1052 


(1.7144, 59.4057) 


Li-norm 


Posterior 


0.6206 


0.6005 


0.0298 


(0.5004, 0.7357) 




Prior 




1.0878 


0.3419 


(0.1173, 2.2368) 


Tend 


Posterior 


0.8380 


0.8400 


0.0720 


(0.7500, 1.0000) 




Prior 




0.9366 


0.1217 


(0.3611, 1.0000) 
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v-cath 

2 I 1 1 1 1 1 




Time 

Fig. 2. The data and the posterior mode of the time course expression of the gene v-cath. 

global maximum are reported in Table 3b; similarly, the last two rows give 
respectively the ratio of the posterior probability to the prior probability 
and the Bayes factor; Table 3b strongly suggests that the expression profile 
increases before its global maximum. 

Let To (Tend) denote the largest time point t such that the time course 
expression profile is unimodal on [0,i\. Let Li-norm denote the area under 
the time course expression profile on [0, To]. Table 3c reports Ton, Tmax, 
Max, Tslope, Slope, Li-norm and Tend of the mode of the posterior den- 
sity v in (2.1) and the sample mean, sample standard deviation (Stdv) and 
support of these features on the sample respectively from the posterior and 
prior distributions. Comparing the Stdv and the support from the posterior 
and the prior, we know that the data have substantial contribution in the 
inference on these features. 

It is customary in microarray literature to cluster genes according to their 
expression profiles for biologists to use. Using the Ton and Tmax of the 
mode of the posterior distribution, we apply the cluster analysis algorithm 
proposed by Hall and Heckman (2002) to cluster the 153 genes into six 
groups, which are I (early onset and early to maximum), IV (mid-course 
onset and early to maximum), V (late onset and mid-course to maximum), 
VI (late onset and early to maximum), and II and III (mid-course onset and 
late to maximum). The scatterplot in Figure 3 reports the cluster analysis 
result; genes with known functions are listed according to the clusters to 
which they belong. 
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Group I (early onset and early to maximum) 35K/p35, egt, me53, 
39K/pp31, pcna, 94K, ie-2, lefl, pnk/pnl, he65, ie-01, ie-1, lef6, pk-2, 
DNA-pol, gp64, pe38, lef3, p48, lefl, p26, ctx, helicase, lefll, lef2, pl5, tip, 
orf-603 

Groups II and III (mid-course onset and late to maximum) orf-1629, 
plO, p74 

Group IV (mid-course onset and early to maximum) 
p43, alk-exo, cg30, odv-el8, PE/pp34 

Group V (late onset and mid-course to maximum) % 

Group VI (late onset and early to maximum) gp41, 
chitinase, ie-0, pkip, sod, lef9, odv-ec27, lef5, env-prot, lef4, lef8, p95, vp39, 
gpl6, 38K, bro, fgf, fp, HisP, iap2, odv-e56, v-ubi, 49K, odv-e25, vp80, gp37, 
leflO, p24, odv-e66 



gta, p40, ptp, iapl, 



v-cath 
P 47, p6.9, 



vlf-1, 



Fig. 3. A classification methodology for the 153 genes based on Ton and Tmax. Selected 
known genes in each classified group are listed at the bottom. 
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Table 4 

Mean and standard deviation of the rank correlation of 



the time 


course expression 


of two genes chosen 


from 




specific 


groups 








Rank correlation 


Group 


Number of genes Mean 


Stdv 


I 


60 


0.8070 


0.1578 


II 


2 


0.9793 


0.0000 


III 


1 


NA* 


NA 


IV 


15 


0.8852 


0.0807 


V 


6 


0.9108 


0.0594 


VI 


69 


0.8981 


0.0867 


All 


153 


0.7717 


0.2023 



*NA means not applicable. 



While Figure 3 helps to shed light on the gene groups, it would be interest- 
ing to see if genes in the same group have a more similar overall expression 
profile. Using the rank correlation of two time course expression profiles as 
the distance between two genes, Table 4 shows that the means of the rank 
correlation for two genes randomly chosen from the same one of the clusters 
are smaller than that from the set of all 153 genes. We note that the rank 
correlation is a measure of similarity between functions studied by Heckman 
and Zamar (2000). This seems to suggest that genes in the same group have 
a more similar expression profile. 

Based on the time course expression profile of the 153 genes obtained by 
the posterior mode, we use the K-means algorithm along with the sample 
rank correlation matrix to cluster them; as in Jiang et al. (2006), we also 
consider five clusters. The five gene clusters are contained in Figure 4. 

We note that clustering is an important step toward gaining insights from 
high-throughput expression data and there is usually some arbitrariness in 
forming clusters. Since clustering in Figure 3 is based only on onset times 
and times to maximum, it is easier to cluster and to interpret, but Figure 4 
is more informative in general. For example, Cluster 5 in Figure 4 consists of 
three genes; one of the most obvious features of these three genes seems to 
be their large expression levels; thus, it is interesting to note that they are 
also in such close proximity to each other in Figure 3 and they form exactly 
the Groups II and III in Figure 3. 

3.3. Total expression amount and structure genes. It is of great interest 
to study the widely discussed conjecture that the virus has a great demand 
of structural proteins. While we cannot provide a definitive answer to this 
question, we think the method of this paper can shed some light on it. One 




of the salient features of the expression profile obtained by our method is the 
area under the time course expression profile (Li-norm); roughly speaking, 
the Li-norm of a gene is the sum of the lives of all the mRNA molecules 
transcribed during the time interval ended at Tend; the life of an mRNA 
molecule is the time length from its transcription to its degradation or its 
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Table 5 

Motifs have to do with onset time. Comparing the onset times of genes having 
specific motifs with those without by the Wilcoxon statistic, which is 
asymptotically standard normal. Minus (plus) values indicate the former 
(latter) is smaller (larger) 



Motif 


With 


Without 


Wilcoxon statistic 


p- value 


Early* 


64 


66 


-2.65 


0.00402 


TAAG 


70 


60 


4.04 


0.00003 


CATG 


69 


61 


-2.66 


0.00391 


Early/CATG 


110 


20 


-2.54 


0.00554 



*The early motif (Early) consists of motifs A(A/T)CGT(G/T) and CGTGC. 



Tend. Although the relation between the Li-norm and the total number of 
the proteins translated is complex, we expect they are positively correlated. 
We indicate in the following that structure genes seem to have larger L\- 
norms. There are 74 baculovirus genes with known names, in which 15 of 
them are structure genes and the rest are not. We find that, in terms of 
the Li-norm, four of the five largest genes are structure genes, giving an 
odds ratio of 21.1; among the ten largest genes, five of them are structure 
genes, giving an odds ratio of 5.4; among the 20 largest genes, 7 of them are 
structure genes, giving an odds ratio of 3.1. We also study by the Wilcoxon 
statistic the null hypothesis that there is no difference in the Li-norm be- 
tween structural genes and nonstructural genes. We find the statistic has 
value 1.73 and using the one-sided Wilcoxon test, it has p-value 0.0418. 
This seems to reinforce the conjecture that structural genes tend to have 
larger Li-norms. We note it seems hard to estimate the Li-norms and to 
study this conjecture by the method of Jiang et al. (2006). 

3.4. Motif and onset time. Biologists tend to think that genes participat- 
ing in the same biological process may be transcriptionally coregulated. One 
preliminary step in studying this phenomenon might be to examine whether 
upstream sequence motifs of a gene have something to do with its tran- 
scription time. In the baculovirus literature [Ayres et al. (1994) and Friesen 
and Miller (2001), for example], motifs A(A/T)CGT(G/T) and CGTGC 
are called the early motif; motif TAAG is called the late motif; genes hav- 
ing motif CATG are usually believed to express early. Jiang et al. (2006) 
studies this by reporting the proportions of these motifs in the 5 gene clus- 
ters obtained from clustering the time course expression crude data. While 
we can conduct a similar study by means of the clusters obtained from our 
Bayesian method, we propose to ignore the clusters and take a more direct 
and relevant approach to address this issue. 
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Based on the onset times of this paper, we study the hypotheses that, 
with a given motif, there is no difference between the onset times of the 
genes with this motif and those without this motif. We study them by the 
Wilcoxon statistic. Table 5 summarizes the numbers of genes having or not 
having these motifs and reports the Wilcoxon statistics and their p-values 
for testing the corresponding one-sided null hypothesis. For example, the 
second row shows that 70 genes have TAAG and 60 genes do not have it, 
its Wilcoxon statistic is 4.04 and the p- value is smaller than 0.0001, which 
seem to suggest that the genes having TAAG tend to have later onset times. 
It seems Table 5 supports the idea that motifs have something to do with 
onset times. 

3.5. Colocalization. Because functionally correlated or coregulated genes 
in an operon of a bacterial genome may be located in nearby loci of the phys- 
ical genome [Lagreid et al. (2003)], Jiang et al. (2006) investigated whether 
a similar gene organization exists in the AcMNPV genome. Based on the 
time course expression normalized data, Jiang et al. (2006) clustered genes 
into five clusters and noted six colocalized clusters. A colocalized cluster is 
defined as a genome region that contains at least five consecutive genes from 
the same gene cluster where no more than one interruption occurs by a gene 
from other gene clusters in either the plus or minus strand. Using the same 
definition of a colocalized cluster, we find there are nine colocalized clusters, 
based on the five clusters exhibited in Figure 4. These nine colocalized clus- 
ters are shown in Figure 5. This seems to suggest that expression profiles 
from our sophisticated method reveals more signals than the naive method. 

The phenomenon that genes with similar expression profile tend to be 
located near each other is referred to as colocalization in Jiang et al. (2006). 
Since the above definition of a colocalized cluster is somewhat arbitrary, we 
present a more systematic study on this in Table 6. Column two and column 
three of Table 6 give respectively the probability of two (three, four, five) 
randomly chosen genes that belong simultaneously to the same one of the 
five clusters and the probability of two (three, four, five) randomly chosen 
neighboring genes that belong simultaneously to the same one of the five 
clusters. Because the numbers in column 2 are smaller than those in column 
3, it seems that colocalization does exist. 

From the viewpoint of evolution, it might also be appealing to see if genes 
close to each other on the genome have a similar expression pattern. One 
relevant null hypothesis would be that there is no difference in the rank 
correlation of expression profiles from nearby genes and that from far away 
genes. For integer < Z < 76 = (153 — l)/2, let Nei(g,Z) denote the set of 
genes whose distance from gene g is no larger than Z; here the distance 
between two genes is the number of genes lying strictly between them. Let 
Rn(Z) denote the set of rank correlations of the time course expression 
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Fig. 5. Genome map view of the five gene clusters color tagged in the baculovirus genome. 
Red, green, blue, black and yellow represent respectively genes in the cluster 1, 2, 3, 4, 5 
in Figure 4- 

profile of a gene g and that of a gene in Nei(g, Z). Let RCn(Z) denote the 
set of rank correlations of the time course expression profile of a gene g and 
that of a gene not in Nei(g, Z). In terms of this notation, the null hypothesis 
becomes that there is no statistical difference between Rn(Zi) and RCn(Z2). 
We studied the hypothesis by the Wilcoxon statistic for many choices of Z\ 
and Z2- Table 7 reports the Wilcoxon statistics and their p- values for testing 
the corresponding one-sided null hypothesis for several choices of Z\ and 
/v2- It suggests that nearby genes do have a higher chance to have a similar 
expression pattern. 



Table 6 

The probability that N randomly chosen 
(neighboring) genes belong to the same cluster in 
Figure 4 



N 


Randomly chosen 


Neighboring 


2 


0.3835 


0.4837 


3 


0.1820 


0.2680 


4 


0.0926 


0.1373 


5 


0.0484 


0.0719 
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Table 7 

Comparing the rank correlation of the time course expression profiles from nearby genes 
and that from far away genes. Rn(Z) is the set of rank correlations for genes having no 
more than Z genes lying between them; HCn(Z) is that for genes having at least Z genes 

lying between them 



Rn(Z) 


RCn(Z) 


Wilcoxon statistic 


p- value 


Rn(2) 


RCn(12) 


8.81 


0.0000 


Rn(4) 


RCn(14) 


5.82 


0.0000 


Rn(6) 


RCn(16) 


5.09 


0.0000 


Rn(8) 


RCn(18) 


2.72 


0.0033 


Rn(10) 


RCn(20) 


2.01 


0.0221 


Rn(12) 


RCn(22) 


0.84 


0.2002 


Rn(14) 


RCn(24) 


1.25 


0.1051 


Rn(16) 


RCn(26) 


2.31 


0.0105 


Rn(18) 


RCn(28) 


1.59 


0.0558 


Rn(20) 


RCn(30) 


0.23 


0.4078 


Rn(22) 


RCn(32) 


0.55 


0.2913 


Rn(24) 


RCn(34) 


1.45 


0.0737 


Rn(26) 


RCn(36) 


0.25 


0.4014 



4. Discussion. We have illustrated a hierarchical Bayesian shape restricted 
regression method for the inference on the genome-wide time course expres- 
sion of virus genes and, based on the profiles provided by this method, we 
have examined salient features on the time course expression curves, studied 
some hypotheses on and thus brought insights into baculovirus biology. It is 
to be noted that our method helps to formulate biological questions quan- 
titatively so as to make modern statistics methods applicable. Although we 
looked at colocalization, the relation between upstream motifs and onset 
times, and that between area under curve and gene function, these are, nev- 
ertheless, preliminary investigations. Further studies are needed to give a 
more complete account of these aspects of the baculovirus. 

In view of the facts that genome-wide expression studies of virus genes 
are gaining popularity, all the previous works in this area use at most crude 
statistics for biological interpretation, and the existing discrepancies between 
the studies need to be resolved, we think our method is useful not only in one 
single expression study of virus genes but also in comparing these studies, 
which would enhance our understanding of the gene regulation network. We 
note that our method can be used to provide comprehensive comparison of 
the time course transcription profiles from different experiments when even 
their time points are not identical, as long as there are enough of them to 
capture their respective main features. 

As for future methodological development, we think the Bernstein-Dirichlet 
prior of Petrone (1999) and the related samplers are also useful in this con- 
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text; studies in this line and comparison with the approach in this paper 
deserve our attention. 

APPENDIX: METROPOLIS- WITHIN-GIBBS ALGORITHM FOR THE 

POSTERIOR 

Let B n = {b n G W 1 ' 1 : F c>bn G I n for some c G [0, 1)}. Denote (b 2 , n ,g, • • • , 
b n ,n, g ) = K,g by (a 2 , s , ■ . ■ , a n , s ) = a g . Let c = (ci,. . . ,ci 54 ); a= (a 1} . . . ,ai 54 ); 

U=(jUi,...,/ii54). 

Let B = {0,c,a,u | = (0i, 2 , 03, 04) G [2.2771,3.2771] x [1.9481, 
2.9481] x [0.5,1.5] x [0.5, 1.5], c 9 G [0, A 9 ],a 9 G B„,/i g G [0,2F 0g ]}. Our com- 
putational strategy consists of the following five MCMC algorithms to up- 
date 0, c, a and u consecutively. Let = (^*',c^',a",u^) G B denote 
the current state of the MCMC chain for sampling the posterior distribution. 

(i) Update 0i and 02 

1. Let 4>\ and 02 be two random samples from £7m/orm(0ii, 0i 2 ) and 
Uniform(<f>2i, 022) respectively; 

2. lety = (0 1 ,0 2 ,4* ) ,4 t) ,c(*),aW,u(*)); 

3. set 

(m)= L, withprob. p = min|l,-^y|, 

a;(*), otherwise. 



(ii) Update 03 and 0, 



1 



1. Let 03 and 04 be two random samples from Uniform ($31, 03 2 ) and 
Uniform (04i, 042 ) respectively; 

2. let y= (0^,0^,03,04,^), aW.uW); 

3. set 

(t+1)= L, withprob. p = min|l,-^yj, 

otherwise. 



.1; 



(iii) Update c 

There are 154 components (ci, . . . , C154) in c; we update them one at a 
time in the order of the coordinates. Suppose c(\...,Cg}_ l have been just 
updated and we now want to update c g . 

1. Let U be a random sample from 5eto(0^,0 2 *^); 

2. let c g = X g x U; let 7ri, s (c ff |0i , 2 ) denote the prior density 7Ti i3 of c 3 
given 0^ and 2 ; 

3. let y = (0W , cf } , . . . , c% c g 41,,..., cg 4 , a<*> , u W ) ; 
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4. set 

,(*)u(*) 

y, with prob. p = min< 1, 



x (t+l) 



k sc'*' , otherwise, 
(iv) Update a 

We update one coordinate of a each time in the order of the coordinates. 
Suppose we have updated a^ g , • • • , af_i g and we now want to update af g . 

1. Let V be a random sample from Beta((p^\(j)^); 

2. let a^g = 2y"( 4 [ 15 ]) 9 x V; let 7r3 jfl (5j ifl |03 , 4$) denote the prior density 
^3,0 ("l n ) °f the coefficient aj i9 given </> 3 and 04 ; 

3. let y be the same vector as except replacing afl by 5j 

4. set 



y, with prob. p = mm< 1, ■ 



V. x^) , otherwise, 
(v) Update u 

There are 154 components (/ii , . . . , //154) in u; we update them one at a 
time in the order. Suppose we have updated p® , . . . , p g _i and we now want 
to update pg . 

1. Let p g be a random sample from Uniform(0,2Yo g ); 

2. let y = (0W ,cW, a<*> , /if } , . . . , pf^ , p g , pf^ , . . . , pfl) ; 

3. set 

xi t+i) = ly, with prob. p = min|l,-^y|, 

t , otherwise. 
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SUPPLEMENTARY MATERIAL 

Profiling time course expression of a single virus gene 

(DOI: 10.1214/09- AOAS258SUPP; .pdf). This nonhierarchical Bayesian 
method, using also Bernstein polynomials, allows nontrivial prior probabil- 
ity on the order of the Bernstein polynomial and is amenable to simulation 
studies, which indicate its excellent numerical performance. 
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